Knowledge-Based Automatic Topic Identification

نویسنده

  • Chin-Yew Lin
چکیده

As the first step in an automated text summarization algorithm, this work presents a new method for automatically identifying the central ideas in a text based on a knowledge-based concept counting paradigm. To represent and generalize concepts, we use the hierarchical concept taxonomy WordNet. By setting appropriate cutoff values for such parameters as concept generality and child-to-parent frequency ratio, we control the amount and level of generality of concepts extracted from the text. 1 1 I n t r o d u c t i o n As the amount of text available online keeps growing, it becomes increasingly difficult for people to keep track of and locate the information of interest to them. To remedy the problem of information overload, a robust and automated text summarizer or information extrator is needed. Topic identification is one of two very impor tant steps in the process of summarizing a text; the second step is summary text generation. A topic is a particular subject that we write about or discuss. (Sinclair et al., 1987). To identify the topics of texts, Information Retrieval (IR) researchers use word frequency, cue word, location, and title-keyword techniques (Paice, 1990). Among these techniques, only word frequency counting can be used robustly across different domains; the other techniques rely on stereotypical text structure or the functional structures of specific domains. Underlying the use of word frequency is the assumption that the more a word is used in a text, the more impor tant it is in that text. This method 1This research was funded in part by ARPA under order number 8073, issued as Maryland Procurement Contract # MDA904-91-C-5224 and in part by the National Science Foundation Grant No. MIP 8902426. recognizes only the literal word forms and nothing else. Some morphological processing may help, but pronominalization and other forms of coreferentiality defeat simple word counting. Furthermore, straightforward word counting can be misleading since it misses conceptual generalizations. For example: "John bought some vegetables, fruit, bread, and milk." What would be the topic of this sentence? We can draw no conclusion by using word counting method; where the topic actually should be: "John bought some groceries." The problem is that word counting method misses the impor tant concepts behind those words: vegetables, fruit, etc. relates to groceries at the deeper level of semantics. In recognizing the inherent problem of the word counting method, recently people have started to use artificial intelligence techniques (Jacobs and ttau, 1990; Mauldin, 1991) and statistical techniques (Salton et al., 1994; Grefenstette, 1994) to incorporate the sementic relations among words into their applications. Following this trend, we have developed a new way to identify topics by counting concepts instead of words. 2 T h e P o w e r o f G e n e r a l i z a t i o n In order to count concept frequency, we employ a concept generalization taxonomy. Figure 1 shows a possible hierarchy for the concept digital computer. According to this hierarchy, if we find iaptop and hand-held computer, in a text, we can infer that the text is about portable computers, which is their parent concept. And if in addition, the text also mentions workstation and mainframe, it is reasonable to say that the topic of the text is related to digital computer. Using a hierarchy, the question is now how to find the most appropriate generalization. Clearly we cannot just use the leaf concepts since at this level we have gained no power from generalization. On the other hand, neither can we use the very top concept everything is a thing. We need a method of identifying the most appropriate concepts somewhere in middle of the taxonomy. Our current solution uses

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Identification and Classification of the Iranian Traditional Music Scales (Dastgāh) and Melody Models (Gusheh): Analytical and Comparative Review on Conducted Research

Background and Aim: Automatic identification and classification of the Iranian traditional music scales (Dastgāh) and melody models (Gusheh) has attracted the attention of the researchers for more than a decade. The current research aims to review conducted researches on this area and consider its different approached and obstacles. Method: The research approach is content analysis and data col...

متن کامل

Using Encyclopedic Knowledge for Automatic Topic Identification

This paper presents a method for automatic topic identification using an encyclopedic graph derived from Wikipedia. The system is found to exceed the performance of previously proposed machine learning algorithms for topic identification, with an annotation consistency comparable to human annotations.

متن کامل

Signal Identification Using a New High Efficient Technique

Automatic signal type identification (ASTI) is an important topic for both the civilian and military domains. Most of the proposed identifiers can only recognize a few types of digital signal and usually need high levels of SNRs. This paper presents a new high efficient technique that includes a variety of digital signal types. In this technique, a combination of higher order moments and hi...

متن کامل

Kohonen Self Organizing for Automatic Identification of Cartographic Objects

Automatic identification and localization of cartographic objects in aerial and satellite images have gained increasing attention in recent years in digital photogrammetry and remote sensing. Although the automatic extraction of man made objects in essence is still an unresolved issue, the man made objects can be extracted from aerial photos and satellite images. Recently, the high-resolution s...

متن کامل

Topic based classification and pattern identification in patents

Article history: Received 15 January 2014 Received in revised form 26 May 2014 Accepted 11 October 2014 Available online 7 November 2014 Patent classification systems and citation networks are used extensively in innovation studies. However, non-unique mapping of classification codes onto specific products/markets and the difficulties in accurately capturing knowledge flows based just on citati...

متن کامل

Ontology based Web Page Topic Identification

With the emergence of the web, lots of research efforts are made in the area of Web Mining. This paper proposes an automatic approach for automatic topic identification from the web pages. The contribution of this research is in the approach of automatic topic identification of web pages that can provide better results. The topic of the web documents is identified through ontological approach.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995